Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

estuary-cdk: allow a cron expression to re-initialize a binding's state #2437

Merged
merged 4 commits into from
Feb 24, 2025

Conversation

Alex-Bair
Copy link
Member

@Alex-Bair Alex-Bair commented Feb 21, 2025

Description:

Background

For a couple capture connectors, bindings need backfilled on some schedule. So far, we've been manually backfilling these bindings every day/week. It would be nice if a resource config level setting existed to automatically backfill on a schedule based on a cron expression, like described in this issue for our batch SQL captures.

Scope

This PR's scope includes:

  • extending the existing ResourceConfig with the ResourceConfigWithSchedule class, allowing resources to specify a schedule for their state to be re-initialized (i.e backfill the binding). This schedule must either be an empty string or a valid cron expression accepted by pycron.
  • adding the last_initialized property to all bindings' states for all connectors.
  • updating source-braintree-native to use ResourceConfigWithSchedule for its incremental bindings

Additional Details

Since the resource config is the same for all bindings in a connector, a connector that needs to use ResourceConfigWithSchedule for at least one binding must use it for all bindings.

pycron was chosen as the cron package for a few reasons:

  • The typically clear choice for a Python cron package croniter may be unpublished from PyPI in the near future due to the maintainer's concerns around EU CRA laws & liabilities. It is not clear that a different maintainer will take over & keep croniter available, and beyond forking & maintaining our own croniter repo, I don't see a long term solution to using that package in our code.
  • pycron has the minimal functionality we need & appears to be somewhat actively maintained & used by others.

Scheduled backfills take precedence over ongoing backfills. If users want backfills to complete before the next scheduled backfill starts, they'll need to update the binding's schedule to have longer intervals between backfills.

Workflow steps:

Schedules are configured per binding in the resource config using a valid cron expression. If a binding's state should be re-initialized per its schedule, it will do so on the next connector restart.

Documentation links affected:

Docs for source-braintree-native should be updated to reflect:

  • The new schedule setting for bindings.
  • That users do not need to contact us to set up scheduled backfills for incremental streams that aren't transactions. These are enabled by default for the connector & can be updated by the users themselves if needed.

Notes for reviewers:

Tested on a local stack. Confirmed:

  • Adding state.last_initialized to existing captures works & does not cause connector failures.
  • Migrating from ResourceConfig to ResourceConfigWithSchedule works & does not cause connector failures.
  • Setting a schedule causes the binding's state to be re-initialized per that specific schedule.
  • Disabling a binding/task before a schedule matches then re-enabling after the schedule no longer matches causes the binding's state to be re-initialized. Ex: If a binding's schedule says it should be backfilled at 11:00 AM and the binding/task is disabled at 10:59 AM then is re-enabled at 11:01 AM, the binding is still backfilled since it "missed" a scheduled backfill.
  • Scheduled backfills occur even if there's an ongoing backfill.

This change is Reviewable

@Alex-Bair Alex-Bair changed the title estuary-cdk: extend ResourceConfig to allow a cron expression to re-initialize a binding's state on a schedule estuary-cdk: allow a cron expression to re-initialize a binding's state Feb 21, 2025
@Alex-Bair Alex-Bair force-pushed the bair/estuary-cdk-scheduled-backfills-feat branch 2 times, most recently from 078f402 to 02ec872 Compare February 21, 2025 18:11
@Alex-Bair Alex-Bair marked this pull request as ready for review February 21, 2025 18:37
@williamhbaker
Copy link
Member

Scheduled backfills take precedence over ongoing backfills

First I'll recognize that I'm pretty sure my initial suggestion was this ^, so I'm totally backtracking. But I am wondering if you think it would be much effort to make it work the other way, where an ongoing backfill is allowed to complete first? As I've thought about it more than might be a better experience, in cases where a backfill takes so long it would be best to at least get it done before it starts again. But if it would be a lot more effort, the way it is in the PR now is probably fine, and would match our backfill script behavior anyway.

@Alex-Bair
Copy link
Member Author

First I'll recognize that I'm pretty sure my initial suggestion was this ^, so I'm totally backtracking. But I am wondering if you think it would be much effort to make it work the other way, where an ongoing backfill is allowed to complete first? As I've thought about it more than might be a better experience, in cases where a backfill takes so long it would be best to at least get it done before it starts again. But if it would be a lot more effort, the way it is in the PR now is probably fine, and would match our backfill script behavior anyway.

No, I actually think it'll be pretty easy to update. It'll just be an extra condition to check if state.backfill is None before we kick off a scheduled backfill. I can update that.

@williamhbaker
Copy link
Member

Another high-level thought: Other than the initial connector startup, I'm not seeing anywhere that it would know that it has crossed the threshold for triggering a new backfill. Connectors using the CDK restart every 24 hours so it would check at least that often, but if an interval more like "every 2 hours" is selected that won't work very well, nor will it work well for a case like "every Friday at 9 PM", since the connector could do its 24-hour restart at 8:59 PM and then not restart again until almost 24 hours later.

I think we'd need to do something like upon opening, start up a coroutine that will set the stopping event at the next schedule time for the backfill. Then when the connector starts back up it will trigger the reinitialization logic. This feels like it may conflict with the goal of allowing existing backfills to complete, but I'm not totally sure about that (do on-going backfills check stopping?).

@Alex-Bair
Copy link
Member Author

I think we'd need to do something like upon opening, start up a coroutine that will set the stopping event at the next schedule time for the backfill. Then when the connector starts back up it will trigger the reinitialization logic. This feels like it may conflict with the goal of allowing existing backfills to complete, but I'm not totally sure about that (do on-going backfills check stopping?).

That sounds like the right direction. It does feel like it conflicts a little with the goal of allowing existing backfills to finish, but we can avoid reinitializing after start up if there's still a backfill that needs to complete. I'll work on doing something like that.

I think whether or not on-going backfills check the stopping event depends on how each binding's fetch_page function is implemented. If fetch_page only fetches a single page in a single invocation and regularly exits this async for loop, then stopping will get checked between pages. But if fetch_page tries to paginate through all results in a single invocation, stopping won't be checked & the connector doesn't know it can gracefully stop & restart. I know there are at least a handful of fetch_page functions like this, and they'd need updated if we want to allow them to be restarted gracefully with stopping. I'd anticipate that even if a single ongoing backfill's fetch_page function tries to paginate through all results in a single invocation, then it'll never check stopping & the connector would only restart every 24 hours & for inferred schema updates.

That said, I still think it makes sense to use stopping in this manner & the solution for those fetch_page functions is to update them to get fewer/as few as possible pages in a single invocation.

@Alex-Bair Alex-Bair force-pushed the bair/estuary-cdk-scheduled-backfills-feat branch from 02ec872 to 221f984 Compare February 21, 2025 23:13
@Alex-Bair
Copy link
Member Author

I've updated per our discussion. Changes include:

  • Moving the cron logic to a dedicated file & making a more useful next_fire function that returns the next date that satisfies the cron expression.
  • Spawning a scheduled_stop coroutine to set stopping at the soonest, future scheduled re-initialization.

I've confirmed:

  • scheduled_stop waits for the soonest future scheduled re-initialization time then sets stopping. This works as expected with a single schedule & multiple schedules.

I kept the "scheduled backfills take precedence over ongoing backfills" behavior. Like you mentioned, the connector would need to somehow know when an ongoing backfill completes to faithfully keep to a schedule & begin a new backfill if it skipped any scheduled ones. That scope could be done in the future if needed.

It's also worth noting that although scheduled_stop does a good job of stopping the connector at the right time, the connector doesn't start back up right away. I suspect the duration between stop & start is based on some frequency external to the connector (probably the control plane?).

Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This commit gets a few failing capture snapshots to pass before I
make changes that could impact CDK connectors.
…re-initializing a resource's state on a schedule

For a couple capture connectors, bindings need backfilled on some schedule.
So far, we've been manually backfilling these bindings every day/week,
and it would be nice if a resource config level setting existed to
automatically backfill.

This commit extends the existing `ResourceConfig` with the
`ResourceConfigWithSchedule` class, allowing resources to specify a
schedule for their state to be re-initialized (i.e backfill the binding).
This schedule must either be an empty string or a valid cron expression
accepted by `pycron`.

In order to know when a binding's state was last initialized, the
`last_initialized` property has been added to all bindings' states.

Since the resource config is the same for all bindings in a
connector, a connector that needs to use `ResourceConfigWithSchedule`
for at least one binding must use it for all bindings.

`pycron` was chosen as the cron package for a few reasons:
- The typically clear choice for a Python cron package `croniter` may be
  unpublished from PyPI in the near future due to the maintainer's
  concerns around EU CRA laws & liabilities. It is not clear that a
  different maintainer will take over & keep `croniter` available, and
  beyond forking & maintaining our own `croniter` repo, I don't see a
  longterm solution to using that package in our code.
- `pycron` has the minimal functionality we need & appears to be somewhat
  actively maintained & used by others.

If there are bindings with cron schedules, we spawn a coroutine to stop
the connector at the soonest future scheduled re-initialzation time to
adhere to the bindings' schedules relatively closely.
Since the CDK now uses `pycron`, the lock files for connectors using
estuary-cdk need updated to include `pycron` as a dependency.
…use `ResourceConfigWithSchedule`

Due to Braintree's limitation of only allowing queries for
`credit_card_verifications`, `customers`, `disputes`, and `subscriptions`
to filter based on their created_at datetime, we can't incrementally
capture updates to previously created data. We've been manually
backfilling these bindings every week, but now we can use the new
`ResourceConfigWithSchedule` resource config to set a schedule for
these bindings to automatically rebackfill.

I set the default schedule for these bindings to backfill every Friday
at 8:00 PM UTC, which seems like a reasonable setting for most use cases
that would prefer to perform possibly lengthy backfills over the weekend.
@Alex-Bair Alex-Bair force-pushed the bair/estuary-cdk-scheduled-backfills-feat branch from 221f984 to ba1594a Compare February 24, 2025 14:46
@Alex-Bair Alex-Bair merged commit 1c7b30d into main Feb 24, 2025
83 of 87 checks passed
@Alex-Bair Alex-Bair deleted the bair/estuary-cdk-scheduled-backfills-feat branch February 24, 2025 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants